This notebook and the contents of this directory were copied from https://github.com/genepattern/single_cell_clustering_notebook.git and adapted for running on ScienceData.

Single-Cell RNA-seq Clustering Analysis Notebook¶

Authors - Clarence K. Mah, Alex T. Wenzel, and Edwin F. Juárez

Contact Email - ejuarez@ucsd.edu

This notebook is described in https://f1000research.com/articles/7-1306/v2

Summary¶

The goal of this notebook is to provide a standard single-cell RNA-seq analysis workflow for pre-processing, identifying sub-populations of cells by clustering, and exploring biomarkers to explain intra-population heterogeneity. The workflow is modeled after the Seurat Guided Clustering Tutorial^{[[1]](https://www.nature.com/articles/nbt.3192)} and performs all analyses using the scanpy^{[[2]](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0)} library.

Use Case¶

Single-cell RNA sequencing (scRNA-seq) has emerged as a popular method to profile gene expression at the resolution of individual cells. Compared to traditional RNA-seq collected from bulk cells or tissue, scRNA-seq enables users to capture cell-by-cell transcriptomic variability. This information can then be used to define and characterize heterogeneity within a population of cells, from identifying known cell types to discovering novel ones. Here we examine a peripheral blood mononuclear cells (PBMCs) dataset, which consists of many immune cell types including T-cells, B-cells, NK-cells, macrophages, and dendritic cells.

Dataset¶

This notebook analyzes a dataset of 3K Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor available from 10X Genomics, sequenced on the Illumina NextSeq 500.

Matrix Data File: 10X Genomics formatted single-cell RNA-seq count matrix of 2,700 single PBMCs

Analysis Overview¶

diagram source

Table of Contents¶

Setup Analysis
1. Load raw count matrix.
Preprocess Counts
1. Filter cells based on QC metrics.
2. Perform log scaling and normalization to total counts.
3. Remove unwanted sources of variation (number of detected molecules per cell as well as the percentage mitochondrial gene content).
4. Detect highly variable genes.
5. Perform linear dimensional reduction (PCA).
Cluster Cells
1. Cluster cells (graph-based clustering) in PCA space and visualize using t-SNE.
Visualize Cluster Markers
1. Explore and visualize cluster markers interactively.
Export Analysis Data
1. Export data to .csv files or a compressed .h5ad format.

Step 1: Setup Analysis¶

First set up the needed environment. We disable ssl certificate check in case you're gonna work with your data on ScienceData over the private network, i.e. connect to https://sciencedata/files/.

#%load_ext autoreload
#%autoreload 2

import os
import subprocess

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import requests

def download(url):
    fileName = os.path.basename(url)
    if fileName and not os.path.isfile(fileName):
        subprocess.check_output(['wget',url])

def upload(file, folderUrl):
    requests.put(file, folderUrl+'/'+file)

#scriptUrl = 'https://raw.githubusercontent.com/genepattern/single_cell_clustering_notebook/master/singlecell.py'
scriptUrl = 'https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/singlecell.py'
requirementsUrl = 'https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/requirements.txt'

download(scriptUrl)
download(requirementsUrl)

os.system("pip show louvain 2>/dev/null > /dev/null; if [ $? -ne 0 ]; then pip install setuptools==57.5.0; pip install -r requirements.txt; fi");

0

Load raw count data for a single-cell RNA-seq experiment.

This notebook accepts either a single matrix file that includes gene and cell identifiers (as the first column and row respectively) or the three-file output from the 10X Genomics single-cell pipeline.

The single matrix file can be in the following formats: csv, txt, tsv, tab, data, h5, h5ad, loom

Instructions

If you have a single matrix file that includes gene and cell (barcode) labels, upload or paste the link to it in the first field below: "matrix data file"
If you have the three files from the 10X Genomics pipeline (matrix.mtx, genes.tsv, barcodes.tsv), upload them or provide a link to each file in the corresponding fields: 10x mtx data file, 10x gene name file, 10x barcodes file

NOTE:

NOTE FOR DEVELOPERS:

For data with 2,700 cells and 33,000 genes, this step will run in 10 to 20 seconds (depending on Internet speed)

Below you can either download data from 10x, unpacked pbmc3k_filtered_gene_bc_matrices.tar.gz and uploaded filtered_gene_bc_matrices to a non-public folder tmpon ScienceData; then uncomment the relevant line. Or you can leave as is and use provided example data.

# Uncomment this to use public example data from ScienceData
data_folder = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/data/"
# Uncomment this to use data you've downloaded to ScienceData from 10x
#data_folder = "https://sciencedata/files/tmp/"
# Uncomment this to use example data from the notebook authors
#data_folder = "https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/"

matrixDataFile = ""
MTXDataFile = data_folder+"matrix.mtx"
geneNameFile = data_folder+"genes.tsv"
barCodesFile = data_folder+"barcodes.tsv"

# Workaround for bug in GPNB 0.7.2

from IPython.display import display, Javascript

display(Javascript("""

setTimeout(function() {

    $(".gp-widget-call").each(function(i, e) {

        const widget = $(e).data("widget");

        widget.options.append_output = false;

    })

}, 1000);

"""))

import nbtools
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import sys

sys.path.append('.')

import genepattern
from singlecell import SingleCellAnalysis, sc
import logging
logging.captureWarnings(True)

sc.settings.verbosity = 0

logging.disable(logging.INFO)


%matplotlib inline


# Workaround for bug in GPNB 0.7.2

from IPython.display import display, Javascript

display(Javascript("""

setTimeout(function() {

    $(".gp-widget-call").each(function(i, e) {

        const widget = $(e).data("widget");

        widget.options.append_output = false;

    })

}, 1000);

"""))

At this point, save the notebook and reload the web page in order to have the javascript generate the UI elements.

analysis = SingleCellAnalysis()

genepattern.GPUIBuilder(

    analysis.setup_analysis,

    function_import='analysis.setup_analysis',

    name='Setup Analysis',

    parameters={

        'csv_filepath': {

            'name': 'matrix data file',

            'description': 'Provide your data file either as a URL or local file path. See above instructions for supported formats.',

            'default': matrixDataFile,

            'type': 'file',

            'kinds': ['csv', 'xlsx', 'txt', 'tsv', 'tab', 'data', 'h5', 'h5ad', 'soft.gz', 'txt.gz', 'anndata', 'mtx', 'zip'],

        },

        'gene_x_cell': {

            'name': 'Matrix dimension names',

            'description': 'If the rows of your matrix are genes, choose "gene by cell", otherwise choose "cell by gene"',

            'default': True,

            'type': 'choice',

            'choices': {

                "gene by cell": True,

                "cell by gene": False

            }

        },

        'mtx_filepath': {

            'name': '10x .mtx data file',

            'description': 'If you have data from 10X, upload or link your .mtx file here.',

            'default': MTXDataFile,

            'choices': {'2700 PBMCs from a Healthy Donor (example) - mtx': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/matrix.mtx'},

            'type': 'file',

            'kinds': ['mtx', 'zip']

        },

        'gene_filepath': {

            'name': '10x gene name file',

            'description': 'If you have data from 10X, upload or link your genes.tsv file here',

            'default': geneNameFile,

            'choices': {'2700 PBMCs from a Healthy Donor (example) - genes': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/genes.tsv'},

            'type': 'file',

            'kinds': ['tsv', 'zip']

        },

        'bc_filepath': {

            'name': '10x barcodes file',

            'description': 'If you have data from 10X, upload or link your barcodes.tsv file here',

            'default': barCodesFile,

            'choices': {'2700 PBMCs from a Healthy Donor (example) - barcodes': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/barcodes.tsv'},

            'type': 'file',

            'kinds': ['tsv', 'zip']

        },

        

        'output_var': {

            'hide': True

        }

    }

)

Step 2: Preprocess Counts¶

genepattern.GPUIBuilder(

    analysis.preprocess_counts,

    function_import='analysis.preprocess_counts',

    name='Preprocess Counts',

    parameters={

        'data': {

            'description': 'Output from the "Setup Analysis" tool.',

            'default': 'analysis'

        },

        

        'min_n_cells': {

            'name': 'filter genes (min. # of cells)',

            'description': 'Include genes expressed in at least this many cells. Blank will be treated as 0.',

            'type': 'number',

            'default': 3

        },

        

        'min_n_genes': {

            'name': 'filter cells (min. # of genes)',

            'description': 'Include cells with at least this many genes. Blank will be treated as 0.',

            'type': 'number',

            'default': 200

        },

        'max_n_genes': {

            'name': 'filter cells (max # of genes)',

            'description': 'Include cells with at most this many genes. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 1700

        },

        'min_n_counts': {

            'name': 'filter cells (min. total counts)',

            'description': 'Include cells with at least this many counts. Blank will be treated as 0.',

            'type': 'number',

            'default': 0

        },

        'max_n_counts': {

            'name': 'filter cells (max total counts)',

            'description': 'Include cells with at most this many counts. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 5700

        },

        'min_percent_mito': {

            'name': 'filter cells (min. % mito. genes)',

            'description': 'Include cells with at least this % of reads mapped to mitochondrial genes. Blank will be treated as 0.',

            'type': 'number',

            'default': 0

        },

        'max_percent_mito': {

            'name': 'filter cells (max % mito. genes)',

            'description': 'Include cells with at most this % of reads mapped to mitochondrial genes. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 6

        },

        'normalization_method': {

            'name': 'log normalize',

            'description': 'Perform log normalization on the data.',

            'choices': {'Yes': 'LogNormalize', 'No': ''}

        },

        'do_regression': {

            'name': 'perform regression',

            'description': 'Regress out effect of cell counts and percent mitochondrial genes',

            'choices': {'Yes': True, 'No': False}

        },

        'output_var': {

            'hide': True

        }

    })

updating: analysis/ (stored 0%)
updating: analysis/uns/ (stored 0%)
updating: analysis/uns/louvain.csv (deflated 4%)
updating: analysis/uns/rank_genes_groups.csv (deflated 65%)
updating: analysis/uns/neighbors.csv (deflated 4%)
updating: analysis/uns/pca.csv (deflated 58%)
updating: analysis/var.csv (deflated 49%)
updating: analysis/obs.csv (deflated 63%)
updating: analysis/varm.csv (deflated 61%)
updating: analysis/X.csv (deflated 89%)
updating: analysis/obsm.csv (deflated 55%)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6732k    0     0  100 6732k      0  10.4M --:--:-- --:--:-- --:--:-- 10.4M

0

genepattern.GPUIBuilder(

    analysis.preprocess_counts,

    function_import='analysis.preprocess_counts',

    name='Preprocess Counts',

    parameters={

        'data': {

            'description': 'Output from the "Setup Analysis" tool.',

            'default': 'analysis'

        },

        

        'min_n_cells': {

            'name': 'filter genes (min. # of cells)',

            'description': 'Include genes expressed in at least this many cells. Blank will be treated as 0.',

            'type': 'number',

            'default': 3

        },

        

        'min_n_genes': {

            'name': 'filter cells (min. # of genes)',

            'description': 'Include cells with at least this many genes. Blank will be treated as 0.',

            'type': 'number',

            'default': 200

        },

        'max_n_genes': {

            'name': 'filter cells (max # of genes)',

            'description': 'Include cells with at most this many genes. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 1700

        },

        'min_n_counts': {

            'name': 'filter cells (min. total counts)',

            'description': 'Include cells with at least this many counts. Blank will be treated as 0.',

            'type': 'number',

            'default': 0

        },

        'max_n_counts': {

            'name': 'filter cells (max total counts)',

            'description': 'Include cells with at most this many counts. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 5700

        },

        'min_percent_mito': {

            'name': 'filter cells (min. % mito. genes)',

            'description': 'Include cells with at least this % of reads mapped to mitochondrial genes. Blank will be treated as 0.',

            'type': 'number',

            'default': 0

        },

        'max_percent_mito': {

            'name': 'filter cells (max % mito. genes)',

            'description': 'Include cells with at most this % of reads mapped to mitochondrial genes. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 6

        },

        'normalization_method': {

            'name': 'log normalize',

            'description': 'Perform log normalization on the data.',

            'choices': {'Yes': 'LogNormalize', 'No': ''}

        },

        'do_regression': {

            'name': 'perform regression',

            'description': 'Regress out effect of cell counts and percent mitochondrial genes',

            'choices': {'Yes': True, 'No': False}

        },

        'output_var': {

            'hide': True

        }

    })

Step 3: Cluster Cells¶

Cluster cells using the Louvain method for community detection on `N` principal components. Then use t-SNE to visualize cells, using `N` principal components.

It is important to note that the fewer PCs we choose to use, the less noise we have when clustering, but at the risk of excluding relevant biological variance. Look at the plot in Step 2 showing the percent variance explained by each principle component and choose a cutoff where there is a clear elbow in the graph. The N principal components to the left of this cutoff are then used to cluster the cells.

Instructions

Perform clustering and visualization of the cells using the sliders to tune parameters.

For 2,600 cells and 10 principal components, this step will run in approximately 60 to 90 seconds

genepattern.GPUIBuilder(

    analysis.cluster_cells,

    function_import='analysis.cluster_cells',

    name='Open Cell Clustering Interface',

    description='This step outputs an interactive interface to cluster cells.',
    
    parameters={
        'output_var': {
            'hide': True
        }
    })

Step 4: Visualize Cluster Markers¶

Visualization the expression of markers on the clustering plot.

Explore genes that are differentially expressed in clusters as potential biomarkers. We provide the Wilcoxon rank-sum test (default) and t-test as test options. The non-parametric Wilcoxon rank-sum test generally performs better for single-cell data since single-cell counts are usually not normally distributed, which the t-test assumes. The Wilcoxon rank-sum test is also less sensitive to outliers.

The following are canonical markers that mark known cell types in this dataset. These can be used to identify what cell types are present and how they correspond to clusters.

| Cell Type | Markers |

| --------- | ------- |

| CD4 T Cells | IL7R |

| LYZCD14+ Monocytes | CD14, LYZ |

| B cells | MS4A1 |

| CD8 T cells | CD8A |

| FCGR3A+ Monocytes | FCGR3A, MS4A7 |

| NK cells | GNLY, NKG7 |

| Dendritic Cells | FCER1A, CST3 |

| Megakaryocytes | PPBP |

Step 4A: Build Cluster Marker Heatmap¶

Instructions

Examine the heatmap to understand how distinct each cluster is compared to others. Modify parameters to show more top markers or change the statistical test.

For 2,600 cells and 2,000 variable genes, this step will take approximately 15 to 30 seconds

genepattern.GPUIBuilder(

    analysis.visualize_top_markers,

    function_import='analysis.visualize_top_markers',

    name='Show Cluster Marker Heatmap',

    description='This step outputs an interactive interface to explore gene expression in clusters of cells.',

    parameters={

        'output_var': {

            'hide': True

        }

    }

)

Step 4B: Visualize Marker Expression¶

Instructions

Visualize the expression of gene(s) in each cell projected on the t-SNE map and the distribution across identified clusters. Provide any number of genes. If more than one gene is provided, the average expression of those genes will be shown.

For 2,600 cells and 2,000 variable genes, this step will take approximately 5 to 15 seconds

genepattern.GPUIBuilder(

    analysis.visualize_marker_expression,

    function_import='analysis.visualize_marker_expression',

    name='Visualize Marker Expression',

    description='This step outputs an interactive interface to explore gene expression in clusters of cells.',

    parameters={

        'output_var': {

            'hide': True

        }

    }

)

Step 4C: Compare Gene Expression Between Clusters¶

Instructions

Use the dropdowns to select which clusters to compare as well as the statistical test to compare them with. Mouse over points on the volcano plot to identify each gene's log2 fold change between clusters and the resulting -log10 p-value.

For 2,600 cells and 2,000 variable genes, this step will take approximately 5 to 15 seconds

genepattern.GPUIBuilder(

    analysis.compare_clusters,

    function_import='analysis.compare_clusters',

    name='Compare Gene Expression Between Clusters',

    description='This step outputs an interactive volcano plot comparing genes between different clusters of cells.',

    parameters={

        'output_var': {

            'hide': True

        }

    }

)

Step 5: Export Analysis Data¶

Relevant Files

X.csv - The preprocessed expression matrix of cells (rows) by genes (columns). This only includes variable genes, usually a much smaller subset of genes compared to the raw counts.
obs.csv - Cell annotations including # of genes, % mitochondrial genes, and cluster assignments
obsm.csv - Coordinates of cells in various dimensional reduction spaces (e.g., PCA, t-SNE).
var.csv - Gene annotations (of variable genes) including the # of cells, mean expression, dispersion, and normalized dispersion statistics.
varm.csv - Loadings of cells in various dimensional reduction spaces (e.g., PCA, t-SNE).
uns/pca_variance_ratio.csv - % variance explained by each principal component.
uns/rank_genes_groups_gene_names.csv - Names of the top ranked marker genes for each cluster.
uns/rank_genes_groups_gene_scores.csv - z-scores of the top ranked marker genes for each cluster.

Instructions

Using the cell below, export results as a series of .csv files or compressed .h5ad file. The files will be created and saved to the chosen local data folder.

After creating, you may want to upload to a ScienceData folder.

For 2,600 cells and 2,000 variable genes, this step will take approximately 10-30 seconds

dataSavePath = 'data/analysis'
dataUploadFolderUrl = 'https://sciencedata/files/tmp/'

genepattern.GPUIBuilder(

    analysis.export_data,

    function_import='analysis.export_data',

    name='Export Analysis Data',

    description='Export data as a series of .csv files or compressed .hda5 file.',

    parameters={

        'path': {

            'name': 'filepath',

            'description': 'Name of directory where file(s) will be saved. Exporting as an h5ad file produces a single file output.',

            'default': dataSavePath

        },

        'h5ad': {

            'name': 'file format',

            'description': 'Choose to save either as .csv files or as a single compressed .h5ad-formatted hdf5 file. It is recommended to export .csv files unless you know what you are doing.',

            'default': False,

            'choices': {

                'Comma-separated values (.csv)': False,

                'H5AD file (HDF5 file in the AnnData formatting convention)': True

            },

        },

        'output_var': {

            'hide': True

        }

    }

)

# Evaluate to create a zip archive and upload it to ScienceData

dirName = os.path.basename(dataSavePath.rstrip('/'))
parentName = os.path.dirname(dataSavePath.rstrip('/'))
os.system("cd "+parentName+"; if [ -d '"+dirName+"' ]; then zip -r "+dirName+".zip "+dirName+"; fi; curl --insecure --upload "+dirName+".zip "+dataUploadFolderUrl.rstrip('/')+'/'+dirName+".zip")

updating: analysis/ (stored 0%)
updating: analysis/uns/ (stored 0%)
updating: analysis/uns/louvain.csv (deflated 4%)
updating: analysis/uns/rank_genes_groups.csv (deflated 65%)
updating: analysis/uns/neighbors.csv (deflated 4%)
updating: analysis/uns/pca.csv (deflated 58%)
updating: analysis/var.csv (deflated 49%)
updating: analysis/obs.csv (deflated 63%)
updating: analysis/varm.csv (deflated 61%)
updating: analysis/X.csv (deflated 89%)
updating: analysis/obsm.csv (deflated 55%)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6732k    0     0  100 6732k      0  10.4M --:--:-- --:--:-- --:--:-- 10.4M

0

Results Interpretation¶

This analysis employs a scRNA-seq gene expression dataset for 2700 peripheral blood mononuclear cells (PBMCs) from a healthy donor as a demonstration of its use. We are able to recapitulate cell types; the clusters can be characterized by visualizing the expression of canonical markers of these cell types on the 2D t-SNE projection plot as done in Step 3 and Step 4.

In Step 4, the visualization of cluster marker expression helps us understand why some cells we identify visually as a single cluster are divided into multiple. For example, LYZ is overexpressed in a cloud of samples that clustering separates as two distinct clusters. The LYZ gene encodes for human lysozyme, an antimicrobial agent associated with blood monocytes. Using the cluster comparison tool, we can see that one of the subclusters exhibits high relative expression of CD14 while the other exhibits high relative expression of FCGR3A, also known as the CD16 receptor gene. These two genes characterize two known subtypes of blood monocytes respectively; classical and non-classical monocytes.

References¶

Satija, R., Farrell, J. A., Gennert, D., Schier, A. F., & Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33, 495. http://dx.doi.org/10.1038/nbt.3192

Wolf, F. A., Angerer, P., & Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology, 19(1), 15. https://doi.org/10.1186/s13059-017-1382-0